Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg compressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

The HomeCredit_columns_description.csv acts as a data dictioanry.

There are 7 different sources of data:

Table sizes

name                       [  rows cols]     MegaBytes         
-----------------------  ------------------  -------
application_train       : [  307,511, 122]:   158MB
application_test        : [   48,744, 121]:   25MB
bureau                  : [ 1,716,428, 17]    162MB
bureau_balance          : [ 27,299,925, 3]:   358MB
credit_card_balance     : [  3,840,312, 23]   405MB
installments_payments   : [ 13,605,401, 8]    690MB
previous_application    : [  1,670,214, 37]   386MB
POS_CASH_balance        : [ 10,001,358, 8]    375MB

dataset

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

The training data has 307511 observations with each row representing one loan detail including Target feature (0: Loan repaid and 1: Loan not repaid) along with other 121 features.

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

This process will summarize the data using statistical and visualization approaches with the objective to focus on key features of the data so that data can be cleaned for training

Summary of Application train

Observation:

Missing data for application train

In all of the above columns with more than 55% data missing the below three are the least correlated with the TARGET value. Hence we are dropping them from the table entirely.

Observation

Observation

As we can see above in the shape that both in the train and test datasets we have dropped the three columns that had a very low correlation to the target of the dataset.

Distribution of the target column

Observation

Correlation with the target column

Observation

Loan Repaid Analysis Based on Gender

Observation

Applicants age and whether they repaid loan or not

Observation

- In general most of the clients are in the age group of around 35-45 years

Applicants Housing Situation

Loan repayment based on Car Ownership and Rental Property

Observation

Repayment of Loan Based on the Loan Type

Observation

Background of the Applicant

Education

Organisation

Occupation

Distribution Plots for AMT columns

AMT Credit

AMT Annuity

AMT Goods Price

Summary of Bureau

Information about the bureau dataset

Statistical summary of the bureau dataset

Percentage and count of missing values in each column of the bureau dataset

Bar plot to visualize the percentage of missing data for each features of bureau dataset

Merging bureau and application_train based on SK_ID_CURR

Correlation matrix for merged dataframe

Observation:

Scatterplot for DAYS_CREDIT and DAYS_ENDDATE_FACT column with respect to the TARGET column

Observation:

Summary of Bureau Balance

Information about the bureau_balance dataset

Statistical summary of the bureau_balance dataset

Percentage and count of missing values in each column of the bureau_balance dataset

Merging bureau_balance and merged bureau/application_train based on SK_ID_BUREAU

Correlation matrix for bureau_balance_merged dataframe

Observation:

Summary of Credit Card Balance

Information about the credit_card_balance dataset

Statistical summary of the credit_card_balance dataset

Percentage and count of missing values in each column of the credit_card_balance dataset

Bar plot to visualize the percentage of missing data for each features of credit_card_balance dataset

Histogram of the distribution of the AMT_PAYMENT_TOTAL_CURRENT feature

Observation:

Distribution of AMT_CREDIT_LIMIT_ACTUAL feature

Observation:

Scatter plot to show the relationship between the AMT_CREDIT_LIMIT_ACTUAL and AMT_BALANCE

Observation:

Merging credit_card_balance and application_train based on SK_ID_CURR

Correlation of TARGET feature with credit_card_balance dataset

Correlation of credit_card_balance dataset

Observation:

Summary of Previous Application

Missing Values for Previous Application

Correlation with the Target column

Contact type of the previous applications

Contract Status of the previous application

Was the client old or new

Final credit amount on the previous application

Summary of POS Cash Balance

Missing Values for POS CASH Balance

Correlation with the Target column

Installments left to pay on previous credit vs Target

Contract status during the month

Summary of Installments Payments

Information about the installments_payments dataset

Statistical summary of the installments_payments dataset

Percentage and count of missing values in each column of the installments_payments dataset

Bar plot to visualize the percentage of missing data for each features of installments_payments dataset

Pie chart to show the relationship between the AMT_PAYMENT and AMT_INSTALMENT

Observation:

Bar plot to show the total number of counts between the AMT_PAYMENT and AMT_INSTALMENT

Observation:

Distribution of NUM_INSTALMENT_VERSION feature

Observation:

Distribution of NUM_INSTALMENT_NUMBER feature

Observation:

Merging installments_payments and application_train based on SK_ID_CURR

Correlation of TARGET feature with installments_payments dataset

Heatmap of installments_payments dataset with TARGET feature

Observation:

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

feature engineering for prevApp table

The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.

import pandas as pd
import dateutil

# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

data.groupby('month', as_index=False).agg({"duration": "sum"})

Pandas reset_index() to convert Multi-Index to Columns We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.

Fixing Column names after Pandas agg() function to summarize grouped data

Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.

For more details unstacking groupby results and examples please see here

For more details and examples please see here

feature transformer

Join the labeled dataset

Join the unlabeled dataset (i.e., the submission file)

Processing pipeline

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Logistic Regression

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Decision Tree

Random Forest

Gaussian Naive Bayes

Comparing all the models visually

THE BEST AUC SCORE IS OF LOGISTIC REGRSSION, I.E 74.89%

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

report submission

Click on this link

Home Credit Default Risk (HCDR)

Final Project for Spring 2023 course: CSCI-P 556 - Applied Machine Learning